1 Basic Concepts

In survival analysis we analyse time-to-event data and try to estimate the underlying distribution of times for events to occur.

1.1 Introduction

For the purposes of this workshop, we focus on a single event type, but the topic is wide and broadly applicable.

1.1.1 Workshop Materials

All materials for this workshop is available in my standard GitHub repo:

https://github.com/kaybenleroll/dublin_r_workshops

book cover

book cover

The content of this workshop is partly based on the book “Applied Survival Analysis Using R” by Dirk F. Moore. The data from this book is available from CRAN via the package asaur and there is a GitHub repo for the code in the book also:

https://github.com/kolaczyk/sand

1.1.2 Example Datasets

This workshop will use a number of different time-to-event datasets, so we look at those first.

1.1.2.1 Telco Churn Data

The telco churn data contains some customer usage information on mobile phone customers and whether or not they left their subscription accounts.

telco_churn_tbl %>% glimpse()
## Observations: 5,000
## Variables: 21
## $ state            <chr> "KS", "OH", "NJ", "OH", "OK", "AL", "MA", "MO", "L...
## $ accountdur       <dbl> 128, 107, 137, 84, 75, 118, 121, 147, 117, 141, 65...
## $ areacode         <chr> "415", "415", "415", "408", "415", "510", "510", "...
## $ phonenumber      <chr> "382-4657", "371-7191", "358-1921", "375-9999", "3...
## $ intlplan         <chr> "no", "no", "no", "yes", "yes", "yes", "no", "yes"...
## $ vmailplan        <chr> "yes", "yes", "no", "no", "no", "no", "yes", "no",...
## $ vmailmessage     <dbl> 25, 26, 0, 0, 0, 0, 24, 0, 0, 37, 0, 0, 0, 0, 0, 0...
## $ daymins          <dbl> 265.1, 161.6, 243.4, 299.4, 166.7, 223.4, 218.2, 1...
## $ daycalls         <dbl> 110, 123, 114, 71, 113, 98, 88, 79, 97, 84, 137, 1...
## $ daycharge        <dbl> 45.07, 27.47, 41.38, 50.90, 28.34, 37.98, 37.09, 2...
## $ evemins          <dbl> 197.4, 195.5, 121.2, 61.9, 148.3, 220.6, 348.5, 10...
## $ evecalls         <dbl> 99, 103, 110, 88, 122, 101, 108, 94, 80, 111, 83, ...
## $ evecharge        <dbl> 16.78, 16.62, 10.30, 5.26, 12.61, 18.75, 29.62, 8....
## $ nightmins        <dbl> 244.7, 254.4, 162.6, 196.9, 186.9, 203.9, 212.6, 2...
## $ nightcalls       <dbl> 91, 103, 104, 89, 121, 118, 118, 96, 90, 97, 111, ...
## $ nightcharge      <dbl> 11.01, 11.45, 7.32, 8.86, 8.41, 9.18, 9.57, 9.53, ...
## $ intlmins         <dbl> 10.0, 13.7, 12.2, 6.6, 10.1, 6.3, 7.5, 7.1, 8.7, 1...
## $ intlcalls        <dbl> 3, 3, 5, 7, 3, 6, 7, 6, 4, 5, 6, 5, 2, 5, 6, 9, 4,...
## $ intlcharge       <dbl> 2.70, 3.70, 3.29, 1.78, 2.73, 1.70, 2.03, 1.92, 2....
## $ custservicecalls <dbl> 1, 1, 0, 2, 3, 0, 3, 0, 1, 0, 4, 0, 1, 3, 4, 4, 1,...
## $ churned          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
telco_churn_tbl %>% head()
## # A tibble: 6 x 21
##   state accountdur areacode phonenumber intlplan vmailplan vmailmessage daymins
##   <chr>      <dbl> <chr>    <chr>       <chr>    <chr>            <dbl>   <dbl>
## 1 KS           128 415      382-4657    no       yes                 25    265.
## 2 OH           107 415      371-7191    no       yes                 26    162.
## 3 NJ           137 415      358-1921    no       no                   0    243.
## 4 OH            84 408      375-9999    yes      no                   0    299.
## 5 OK            75 415      330-6626    yes      no                   0    167.
## 6 AL           118 510      391-8027    yes      no                   0    223.
## # ... with 13 more variables: daycalls <dbl>, daycharge <dbl>, evemins <dbl>,
## #   evecalls <dbl>, evecharge <dbl>, nightmins <dbl>, nightcalls <dbl>,
## #   nightcharge <dbl>, intlmins <dbl>, intlcalls <dbl>, intlcharge <dbl>,
## #   custservicecalls <dbl>, churned <lgl>

1.1.2.2 Prostate Survival

The prostate survival data is generated data from a cancer survival study.

prostate_surv_tbl %>% glimpse()
## Observations: 14,294
## Variables: 5
## $ grade    <fct> mode, mode, poor, mode, mode, poor, poor, mode, mode, poor...
## $ stage    <fct> T1c, T1ab, T1c, T2, T1c, T2, T1c, T1ab, T1c, T2, T1c, T1c,...
## $ ageGroup <fct> 80+, 75-79, 75-79, 70-74, 70-74, 75-79, 80+, 80+, 75-79, 7...
## $ survTime <int> 18, 23, 37, 27, 42, 38, 18, 78, 47, 13, 81, 23, 21, 13, 11...
## $ status   <int> 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
prostate_surv_tbl %>% head()
## # A tibble: 6 x 5
##   grade stage ageGroup survTime status
##   <fct> <fct> <fct>       <int>  <int>
## 1 mode  T1c   80+            18      0
## 2 mode  T1ab  75-79          23      0
## 3 poor  T1c   75-79          37      0
## 4 mode  T2    70-74          27      0
## 5 mode  T1c   70-74          42      0
## 6 poor  T2    75-79          38      2

1.1.2.3 Smoker Data

The smoker data is a smoking relapse study involving a number of different treatments and details on when and if the patient resumed smoking.

pharmaco_smoker_tbl %>% glimpse()
## Observations: 125
## Variables: 14
## $ id             <int> 21, 113, 39, 80, 87, 29, 16, 35, 54, 70, 84, 85, 25,...
## $ ttr            <int> 182, 14, 5, 16, 0, 182, 14, 77, 2, 0, 12, 182, 21, 3...
## $ relapse        <int> 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0...
## $ grp            <fct> patchOnly, patchOnly, combination, combination, comb...
## $ age            <int> 36, 41, 25, 54, 45, 43, 66, 78, 40, 38, 64, 51, 37, ...
## $ gender         <fct> Male, Male, Female, Male, Male, Male, Male, Female, ...
## $ race           <fct> white, white, white, white, white, hispanic, black, ...
## $ employment     <fct> ft, other, other, ft, other, ft, pt, other, ft, ft, ...
## $ yearsSmoking   <int> 26, 27, 12, 39, 30, 30, 54, 56, 25, 23, 30, 35, 23, ...
## $ levelSmoking   <fct> heavy, heavy, heavy, heavy, heavy, heavy, heavy, lig...
## $ ageGroup2      <fct> 21-49, 21-49, 21-49, 50+, 21-49, 21-49, 50+, 50+, 21...
## $ ageGroup4      <fct> 35-49, 35-49, 21-34, 50-64, 35-49, 35-49, 65+, 65+, ...
## $ priorAttempts  <int> 0, 3, 3, 0, 0, 2, 0, 10, 4, 10, 12, 1, 5, 6, 5, 2, 1...
## $ longestNoSmoke <int> 0, 90, 21, 0, 0, 1825, 0, 15, 7, 90, 365, 7, 1095, 1...
pharmaco_smoker_tbl %>% head()
## # A tibble: 6 x 14
##      id   ttr relapse grp     age gender race  employment yearsSmoking
##   <int> <int>   <int> <fct> <int> <fct>  <fct> <fct>             <int>
## 1    21   182       0 patc…    36 Male   white ft                   26
## 2   113    14       1 patc…    41 Male   white other                27
## 3    39     5       1 comb…    25 Female white other                12
## 4    80    16       1 comb…    54 Male   white ft                   39
## 5    87     0       1 comb…    45 Male   white other                30
## 6    29   182       0 comb…    43 Male   hisp… ft                   30
## # ... with 5 more variables: levelSmoking <fct>, ageGroup2 <fct>,
## #   ageGroup4 <fct>, priorAttempts <int>, longestNoSmoke <int>

1.2 Basic Principles

1.2.1 Data Censoring and Truncation

In the problems we work on, our data observation process does not allow us to fully observe the variable in question. Instead, our observations are often right-censored - that is, we know the value in question is at least the observed value, but it may also be higher.

Expressing this formally, suppose \(T^*\) is the time to event, and \(U\) is the time to the censoring, the our observed time \(T\), and censoring indicator, \(\delta\), are given by

\[\begin{eqnarray*} T &=& \min(T^*, U) \\ \delta &=& I[T^* \leq U] \end{eqnarray*}\]

A less common phenomenon is left-censoring - where we observe the event to be at most a given duration. This may happen in medical studies where continual observation of the patients is not possible or feasible.

1.2.2 Hazard and Survival Functions

Our key goal is to find the survival distribution - the distribution of times from some given start point to the time of the event.

Two common ways of specifying this distribution are the survival function, \(S(t)\) and the hazard function, \(h(t)\).

\(S(t)\) is the probability of surviving to time \(t\), so is defined as follows:

\[ S(t) = P(T > t), \, 0 \leq t \leq \infty \]

Thus, \(S(0) = 1\) and decreases monotonically with time \(t\). As it is a probability it is non-negative.

We can also define the survival function in terms of the instantaneous failure rate, the probability of failing at exactly time \(t\). More formally,

\[ \lambda(t) = \lim_{\delta \to 0} \frac{P(t \leq T \leq t + \delta \; | \, T > t)}{\delta} \]

This is also called the intensity function or the force of mortality

1.2.3 Cumulative Functions

Some analyses make it easier to use the cumulative hazard function

\[ \Lambda(t) = \int^t_0 \lambda(u) \, du \]

As a consequence of this definition we also have

\[ S(t) = \exp(- \Lambda(t)) \]

1.2.4 Mean and Median Survival

Finally, two common statistics for survival are the mean survival time and the median survival time:

The mean survival time is the expected value of the distribution, using standard probability definitions:

\[ \mu = E(T) = \int^\infty_0 t f(t) \, dt = \int^\infty_0 S(t) \, dt \]

The median survival time, \(t_{\text{med}}\) such that

\[ S(t_{\text{med}}) = 0.5 \]

1.3 Example Distributions

Let’s look at some different survival distributions.

1.3.1 Constant Hazard

t_seq <- 1:100

h_const_seq <- rep_along(t_seq, 0.01)
S_const_seq <- cumprod(c(1, (1 - h_const_seq)))


hazard_plot <- ggplot() +
    geom_line(aes(x = t_seq, y = h_const_seq)) +
    expand_limits(y = 0) +
    xlab("Time") +
    ylab("Instantaneous Hazard")

cuml_plot <- ggplot() +
    geom_line(aes(x = c(0, t_seq), y = S_const_seq)) +
    expand_limits(y = 0) +
    xlab("Time") +
    ylab("Cumulative Survival")

plot_grid(hazard_plot, cuml_plot, nrow = 2)

If we increase the constant hazard, we get a similar shape, but the curve declines faster.

h_const_high_seq <- rep_along(t_seq, 0.03)
S_const_high_seq <- cumprod(c(1, (1 - h_const_high_seq)))


hazard_plot <- ggplot() +
    geom_line(aes(x = t_seq, y = h_const_high_seq)) +
    geom_line(aes(x = t_seq, y = h_const_seq)
             ,colour = 'blue', linetype = 'dashed') +
    expand_limits(y = 0) +
    xlab("Time") +
    ylab("Instantaneous Hazard")

cuml_plot <- ggplot() +
    geom_line(aes(x = c(0, t_seq), y = S_const_high_seq)) +
    geom_line(aes(x = c(0, t_seq), y = S_const_seq)
             ,colour = 'blue', linetype = 'dashed') +
    expand_limits(y = 0) +
    xlab("Time") +
    ylab("Cumulative Survival")

plot_grid(hazard_plot, cuml_plot, nrow = 2)

1.3.2 Early Hazard

early_seq <- seq(0.10, 0.01, by = -0.01)
late_seq  <- rep(0.01, 100 - length(early_seq))

h_early_seq <- c(early_seq, late_seq)
S_early_seq <- cumprod(c(1, (1 - h_early_seq)))


hazard_plot <- ggplot() +
    geom_line(aes(x = t_seq, y = h_early_seq)) +
    expand_limits(y = 0) +
    xlab("Time") +
    ylab("Instantaneous Hazard")

cuml_plot <- ggplot() +
    geom_line(aes(x = c(0, t_seq), y = S_early_seq)) +
    expand_limits(y = 0) +
    xlab("Time") +
    ylab("Cumulative Survival")

plot_grid(hazard_plot, cuml_plot, nrow = 2)

1.3.3 Late Hazard

late_seq  <- seq(0.01, 0.10, by = 0.0015)
early_seq <- rep(0.01, 100 - length(late_seq))

h_late_seq <- c(early_seq, late_seq)
S_late_seq <- cumprod(c(1, (1 - h_late_seq)))


hazard_plot <- ggplot() +
    geom_line(aes(x = t_seq, y = h_late_seq)) +
    expand_limits(y = 0) +
    xlab("Time") +
    ylab("Instantaneous Hazard")

cuml_plot <- ggplot() +
    geom_line(aes(x = c(0, t_seq), y = S_late_seq)) +
    expand_limits(y = 0) +
    xlab("Time") +
    ylab("Cumulative Survival")

plot_grid(hazard_plot, cuml_plot, nrow = 2)

2 Estimations of the Survival Functions

2.1 The Kaplan-Meier Estimator

Kaplan-Meier is the standard method for estimating the survival function of a given dataset. Formally, it is defined as follows

\[ \hat{S}(t) = \prod_{t_i \leq t} (1 - \hat{q}_i) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right) \]

where \(n_i\) is the number of subjects at risk at time \(t\), and \(d_i\) is the number of individuals who fail at that time.

2.1.1 Using Kaplan-Meier

In R, we construct KM estimators using the survfit() function.

Before we move on to our datasets, we start with a small set of data.

tt   <- c(7, 6, 6, 5, 2, 4)
cens <- c(0, 1, 0, 0, 1, 1)
grp  <- c(0, 0, 0, 1, 1, 1)

Surv(tt, cens)
## [1] 7+ 6  6+ 5+ 2  4
sample_tbl <- data_frame(tt = tt, cens = cens, grp = grp)

example_km <- survfit(Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = 'log-log')

plot(example_km)

Basic plotting routines are worth trying, but the survminer package has specialised plots that use ggplot2 to create them.

ggsurvplot(example_km)

Printing out the ‘fitted’ object gives us some basic statistics:

example_km %>% print()
## Call: survfit(formula = Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = "log-log")
## 
##       n  events  median 0.95LCL 0.95UCL 
##       6       3       6       2      NA

We get more details from the summary() function:

example_km %>% summary()
## Call: survfit(formula = Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     2      6       1    0.833   0.152       0.2731        0.975
##     4      5       1    0.667   0.192       0.1946        0.904
##     6      3       1    0.444   0.222       0.0662        0.785

2.1.1.1 Exercises

  1. Construct the KM estimator for the telco churn data
  2. What is the median survival time for this data?
  3. What is the mean survival time?
  4. Repeat the above for the other two datasets.

2.1.2 Follow-up Time

A useful measure may be how long the observation period lasts, something that can be subtly difficult to measure.

One method is to switch the censoring labels - that is, we consider the original event as a censoring of the observation in the study.

sample_tbl <- sample_tbl %>%
    mutate(follow = 1 - cens)

follow_km <- survfit(Surv(tt, follow) ~ 1, data = sample_tbl, conf.type = 'log-log')

ggsurvplot(follow_km)

follow_km %>% summary()
## Call: survfit(formula = Surv(tt, follow) ~ 1, data = sample_tbl, conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5      4       1     0.75   0.217       0.1279        0.961
##     6      3       1     0.50   0.250       0.0578        0.845
##     7      1       1     0.00     NaN           NA           NA

2.2 Smoothed Hazard Functions

An empirical estimate of the hazard function is given by

\[ \mu(i) = \frac{d_i}{n_i} \]

This is a very noisy estimate as it is sensitive to sample noise.

2.2.1 The Hazard Estimator

To obtain more smooth functions we use kernel density estimator techniques.

sample_muhaz <- muhaz(sample_tbl$tt, sample_tbl$cens, max.time = 7)

plot(sample_muhaz)

broom has tidying methods for muhaz() and this allows us to create plots with ggplot2

muhaz_tidy_tbl <- sample_muhaz %>% tidy()

muhaz_tidy_tbl %>% glimpse()
## Observations: 101
## Variables: 2
## $ time     <dbl> 0.00, 0.07, 0.14, 0.21, 0.28, 0.35, 0.42, 0.49, 0.56, 0.63...
## $ estimate <dbl> 0.000000000, 0.000000000, 0.000000000, 0.000871904, 0.0034...

We have estimates of the hazard function now and so can plot it.

ggplot(muhaz_tidy_tbl) +
    geom_line(aes(x = time, y = estimate)) +
    expand_limits(x = 0, y = 0) +
    xlab("Time") +
    ylab("Estimated Hazard")

2.2.1.1 Exercises

  1. Construct the smoothed estimator for the telco churn data
  2. What time has the highest hazard rate?
  3. What time has the lowest hazard rate?
  4. Repeat the above for the other two datasets.

2.2.2 Boundary Corrections

By default, muhaz() corrects for the boundary on both sides, but we may not wish this. To get estimates without this correction, we add it as an argument to the function call.

sample_nocorr_muhaz <- muhaz(sample_tbl$tt, sample_tbl$cens, max.time = 7
                            ,b.cor = 'none')

plot(sample_nocorr_muhaz)

sample_nocorr_muhaz_tidy_tbl <- sample_nocorr_muhaz %>% tidy()

ggplot(sample_nocorr_muhaz_tidy_tbl) +
    geom_line(aes(x = time, y = estimate)) +
    expand_limits(x = 0, y = 0) +
    xlab("Time") +
    ylab("Estimated Hazard")

To help ensure these smoothed estimates are capturing the correct aspects of the data, we also have the equivalent pehaz() functions - giving histogram estimates of the hazards, much how histogram and kernel estimates are discrete and continuous analogies of one another.

2.2.2.1 Exercises

  1. Construct the non-corrected smoothed estimator for the telco churn data
  2. Repeat the above for the other two datasets.

2.3 Comparing KM and Smoothed Estimates

Now that we have both methods of checking the empirical estimates, it is good to compare them.

Before we do this, we note that Kaplan-Meier estimates the survival function but muhaz() gives us estimates of the hazard function.

Thus we need to do some conversions and the easiest way to do this is to numerically integrate the hazard functions to calculate the survival function and then compare to the KM estimate.

muhaz_surv_tbl <- muhaz_tidy_tbl %>%
    mutate(dt  = c(0, diff(time))
          ,S_t = exp(-cumsum(estimate * dt))
    )

ggsurvplot(example_km)$plot +
    geom_line(aes(x = time, y = S_t), data = muhaz_surv_tbl)

2.3.1 No Boundary Correction

We also want to compare the results without the boundary correction.

muhaz_nocorr_surv_tbl <- sample_nocorr_muhaz_tidy_tbl %>%
    mutate(dt  = c(0, diff(time))
          ,S_t = exp(-cumsum(estimate * dt))
    )

ggsurvplot(example_km)$plot +
    geom_line(aes(x = time, y = S_t), colour = 'blue', data = muhaz_surv_tbl) +
    geom_line(aes(x = time, y = S_t), data = muhaz_nocorr_surv_tbl)

We see that removing the boundary corrections can cause big discrepancies between the smoothed and discrete estimates (at least in this case)

2.4 Estimating Survival Curves for the Telco Churn Data

We now are going to build various Kaplan-Meier estimates for all the data, as well as splits on the data along vmailplan - whether or not the account had a voice mail, and intlplan - whether or not the account had an international calls plan.

telco_all_km   <- survfit(Surv(accountdur, churned) ~ 1,         data = telco_churn_tbl)
telco_vmail_km <- survfit(Surv(accountdur, churned) ~ vmailplan, data = telco_churn_tbl)
telco_intl_km  <- survfit(Surv(accountdur, churned) ~ intlplan,  data = telco_churn_tbl)

2.4.1 Plotting the Curves

Having created the KM estimates, we first plot the curves to check for differences.

telco_all_km   %>% ggsurvplot(censor = FALSE)

telco_vmail_km %>% ggsurvplot(censor = FALSE)

telco_intl_km  %>% ggsurvplot(censor = FALSE)

The estimators comes with upper and lower confidence bounds, and it may be useful to include these estimates when determining the differences. We will look at a more formal way of discriminating between curves in a while.

telco_all_km   %>% ggsurvplot(censor = FALSE, conf.int = TRUE)

telco_vmail_km %>% ggsurvplot(censor = FALSE, conf.int = TRUE)

telco_intl_km  %>% ggsurvplot(censor = FALSE, conf.int = TRUE)

2.4.2 Combining KM Estimators

This is useful, but it would be better to combine all these together. To do this, we need the tables of the estimator data, and stack them together.

Fortunately, the broom package is compatible with basic survival analysis routines, so this provides us with some utilities to help with this.

If we need to access the KM data directly, we can use tidy() to get a look:

telco_all_km_data   <- telco_all_km   %>% tidy()
telco_all_km_data   %>% glimpse()
## Observations: 218
## Variables: 8
## $ time      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ n.risk    <dbl> 5000, 4989, 4987, 4979, 4976, 4974, 4972, 4967, 4965, 496...
## $ n.event   <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 2, 0, 1, ...
## $ n.censor  <dbl> 10, 1, 7, 2, 2, 2, 5, 2, 3, 3, 6, 4, 9, 1, 5, 9, 6, 4, 8,...
## $ estimate  <dbl> 0.999800, 0.999600, 0.999399, 0.999198, 0.999198, 0.99919...
## $ std.error <dbl> 0.000200020, 0.000283183, 0.000347001, 0.000400944, 0.000...
## $ conf.high <dbl> 1.000000, 1.000000, 1.000000, 0.999984, 0.999984, 0.99998...
## $ conf.low  <dbl> 0.999408, 0.999045, 0.998720, 0.998414, 0.998414, 0.99841...
telco_vmail_km_data <- telco_vmail_km %>% tidy()
telco_vmail_km_data %>% glimpse()
## Observations: 404
## Variables: 9
## $ time      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 1...
## $ n.risk    <dbl> 3677, 3670, 3669, 3664, 3662, 3660, 3658, 3655, 3654, 365...
## $ n.event   <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, ...
## $ n.censor  <dbl> 6, 0, 4, 1, 2, 2, 3, 1, 3, 2, 4, 6, 1, 5, 9, 4, 4, 5, 3, ...
## $ estimate  <dbl> 0.999728, 0.999456, 0.999183, 0.998911, 0.998911, 0.99891...
## $ std.error <dbl> 0.000271998, 0.000385030, 0.000471756, 0.000545035, 0.000...
## $ conf.high <dbl> 1.000000, 1.000000, 1.000000, 0.999978, 0.999978, 0.99997...
## $ conf.low  <dbl> 0.999195, 0.998702, 0.998260, 0.997844, 0.997844, 0.99784...
## $ strata    <chr> "vmailplan=no", "vmailplan=no", "vmailplan=no", "vmailpla...
telco_intl_km_data  <- telco_intl_km  %>% tidy()
telco_intl_km_data  %>% glimpse()
## Observations: 375
## Variables: 9
## $ time      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ n.risk    <dbl> 4527, 4516, 4515, 4508, 4507, 4505, 4503, 4498, 4496, 449...
## $ n.event   <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, ...
## $ n.censor  <dbl> 10, 1, 6, 1, 2, 2, 5, 2, 3, 3, 6, 3, 9, 1, 3, 8, 6, 4, 7,...
## $ estimate  <dbl> 0.999779, 0.999779, 0.999558, 0.999558, 0.999558, 0.99955...
## $ std.error <dbl> 0.000220921, 0.000220921, 0.000312845, 0.000312845, 0.000...
## $ conf.high <dbl> 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, 1.00000...
## $ conf.low  <dbl> 0.999346, 0.999346, 0.998945, 0.998945, 0.998945, 0.99894...
## $ strata    <chr> "intlplan=no", "intlplan=no", "intlplan=no", "intlplan=no...

We stack the data and create a line plot of each of the survival curves.

stacked_km_tbl <- list(telco_all_km_data %>% mutate(strata = 'ALL')
                      ,telco_vmail_km_data
                      ,telco_intl_km_data
                       ) %>%
    bind_rows()


ggplot(stacked_km_tbl) +
    geom_line(aes(x = time, y = estimate,  colour = strata)) +
    geom_line(aes(x = time, y = conf.high, colour = strata), linetype = 'dashed') +
    geom_line(aes(x = time, y = conf.low,  colour = strata), linetype = 'dashed')

This chart is not as useful as it could be - it is too ‘busy’. We need to simplify.

Instead of including everything, we will look at the international plan curves along with the estimator calculated from the full dataset.

ggplot(stacked_km_tbl %>% filter(grepl('ALL|intl', strata))) +
    geom_line(aes(x = time, y = estimate,  colour = strata)) +
    geom_line(aes(x = time, y = conf.high, colour = strata), linetype = 'dashed') +
    geom_line(aes(x = time, y = conf.low,  colour = strata), linetype = 'dashed')

From this, it looks like people without international plans are similar to the population as a whole, but those with plans have a much worse churn effect. This type of information will be useful later.

We can do a similar thing for the voice-mail plans.

ggplot(stacked_km_tbl %>% filter(grepl('ALL|vmail', strata))) +
    geom_line(aes(x = time, y = estimate,  colour = strata)) +
    geom_line(aes(x = time, y = conf.high, colour = strata), linetype = 'dashed') +
    geom_line(aes(x = time, y = conf.low,  colour = strata), linetype = 'dashed')

This split is more mixed, but it appears that those with a voicemail plan have a better churn experience than the population, but those without are similar to the population as a whole.

3 Comparing Curves: The Log-Rank Test

Now that we have ways of constructing different survival curves from our data we turn our attention to checking if these differences can be explained by random chance

Control Treatment Total
Failure \(d_{0i}\) \(d_{1i}\) \(d_i\)
Non-failure \(n_{0i} - d_{0i}\) \(n_{1i} - d_{1i}\) \(n_i - d_i\)
Total \(n_{0i}\) \(n_{1i}\) \(n_i\)

From this table of values, we define two values:

\[\begin{eqnarray*} e_{0i} &=& E(d_{0i}) = \frac{n_{0i} d_i}{n_i} \\ v_{0i} &=& \text{var}(d_{0i}) = \frac{n_{0i} n_{1i} d_i (n_i - d_i)}{n_i^2(n_i - 1)} \\ U_0 &=& \sum_{i=1}^{N} (d_{0i} - e_{0i}) = \sum d_{0i} - \sum e_{0i} \\ V_0 &=& \text{var}(U_0) = \sum v_{0i} \end{eqnarray*}\]

Having defined these variables as above, we can show that

\[ \frac{U_0^2}{V_0} \sim \chi_1^2 \]

This gives us a natural statistical test to compare the two curves.

3.1 Comparing Binary Categories

The simplest case is when we are comparing survival data split down binary outcomes, such as whether or not the customer has a voicemail plan or a plan with international calls.

3.1.1 Voicemail Differences

We start by comparing the voicemail plans.

telco_vmail_survdiff <- survdiff(Surv(accountdur, churned) ~ vmailplan
                                ,data = telco_churn_tbl)

telco_vmail_survdiff %>% print()
## Call:
## survdiff(formula = Surv(accountdur, churned) ~ vmailplan, data = telco_churn_tbl)
## 
##                  N Observed Expected (O-E)^2/E (O-E)^2/V
## vmailplan=no  3677      605      523      13.0      50.1
## vmailplan=yes 1323      102      184      36.8      50.1
## 
##  Chisq= 50.1  on 1 degrees of freedom, p= 1e-12
telco_vmail_survdiff %>% glance()
## # A tibble: 1 x 3
##   statistic    df  p.value
##       <dbl> <dbl>    <dbl>
## 1      50.1     1 1.48e-12

Such a small \(p\)-value suggests the differences in the curves are unlikely to be due to chance, and this gels with what we observed on the plots of the curves.

3.1.2 International Calls Differences

Now we compare the differences between plans with and without international calls.

telco_intl_survdiff <- survdiff(Surv(accountdur, churned) ~ intlplan
                               ,data = telco_churn_tbl)

telco_intl_survdiff %>% print()
## Call:
## survdiff(formula = Surv(accountdur, churned) ~ intlplan, data = telco_churn_tbl)
## 
##                 N Observed Expected (O-E)^2/E (O-E)^2/V
## intlplan=no  4527      508    638.6      26.7       277
## intlplan=yes  473      199     68.4     249.3       277
## 
##  Chisq= 277  on 1 degrees of freedom, p= <2e-16
telco_intl_survdiff %>% glance()
## # A tibble: 1 x 3
##   statistic    df p.value
##       <dbl> <dbl>   <dbl>
## 1      277.     1       0

3.2 Comparing Multiple Categories

The approach discussed can be extended to categorical variables with multiple values. We can use areacode for this as it has three values in it:

telco_churn_tbl %>% count(areacode)
## # A tibble: 3 x 2
##   areacode     n
##   <chr>    <int>
## 1 408       1259
## 2 415       2495
## 3 510       1246

Let’s check how it splits the survival curves.

telco_areacode_km <- survfit(Surv(accountdur, churned) ~ areacode
                            ,data = telco_churn_tbl)

telco_areacode_km %>% ggsurvplot(censor = TRUE, conf.int = TRUE)

Now let’s run the log-rank test on this data:

telco_areacode_survdiff <- survdiff(Surv(accountdur, churned) ~ areacode
                                   ,data = telco_churn_tbl)

telco_areacode_survdiff %>% print()
## Call:
## survdiff(formula = Surv(accountdur, churned) ~ areacode, data = telco_churn_tbl)
## 
##                 N Observed Expected (O-E)^2/E (O-E)^2/V
## areacode=408 1259      177      185    0.3357    0.4562
## areacode=415 2495      346      349    0.0236    0.0468
## areacode=510 1246      184      173    0.6667    0.8866
## 
##  Chisq= 1  on 2 degrees of freedom, p= 0.6
telco_areacode_survdiff %>% glance()
## # A tibble: 1 x 3
##   statistic    df p.value
##       <dbl> <dbl>   <dbl>
## 1      1.03     2   0.598

Again, the results of the log-rank test meshes with our visual inspection: a \(p\)-value of 0.6 suggests differences are likely due to chance, and for most of the curves the three area codes pretty much share the same experience.

Using our domain knowledge, this also makes sense - we would not expect phones based in different area codes to have substantial differences: area codes are too big and broad to show systemic geographic effects.

3.3 Discretizing Continuous Variables

In the event that we have numeric variables affecting survival, we need to perform some discretization of those values to allow us to assess these effects.

Regression models discussed later will allow us tackle these dependencies in a more direct way, but for now we will use this simplified approach.

One numeric variable likely to have an effect on churn is custservicecalls - the number of calls made to customer service. We would expect that an account with more service calls is likely to be dissatisfied, so higher counts of calls corresponds to a higher churn rate.

3.3.1 Discretizing custservicecalls

Having decided to bin our calls, we now need to decide how we go about this. One way is to split the data by quantiles, so we can try this first.

Should this not prove satisfactory, it may help inform more suitable ways to bin this value.

telco_churn_tbl %>% count(custservicecalls) %>% mutate(cuml = cumsum(n))
## # A tibble: 10 x 3
##    custservicecalls     n  cuml
##               <dbl> <int> <int>
##  1                0  1023  1023
##  2                1  1786  2809
##  3                2  1127  3936
##  4                3   665  4601
##  5                4   252  4853
##  6                5    96  4949
##  7                6    34  4983
##  8                7    13  4996
##  9                8     2  4998
## 10                9     2  5000

We see this is a very skewed distribution to the lower end of call counts, so ntile() is not as effective here as it will not properly separate the counts. Instead we use cut() as that gives us more control over the binning.

telco_churn_cat_tbl <- telco_churn_tbl %>%
    mutate(calls_cat = cut(custservicecalls
                          ,breaks = c(0, 1, 2, 5, Inf)
                          ,right = FALSE)
    )

telco_callscat_km <- survfit(Surv(accountdur, churned) ~ calls_cat
                            ,data = telco_churn_cat_tbl)

telco_callscat_km %>% ggsurvplot(censor = FALSE, conf.int = TRUE)

It looks like there is not much difference for lower counts, but there is a worse churn experience when the number of customer service calls is 5 or higher.

telco_callscat_survdiff <- survdiff(Surv(accountdur, churned) ~ calls_cat
                                   ,data = telco_churn_cat_tbl)

telco_callscat_survdiff %>% print()
## Call:
## survdiff(formula = Surv(accountdur, churned) ~ calls_cat, data = telco_churn_cat_tbl)
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## calls_cat=[0,1)   1023      121    145.1      3.99      5.04
## calls_cat=[1,2)   1786      190    254.7     16.45     25.82
## calls_cat=[2,5)   2044      306    285.4      1.49      2.51
## calls_cat=[5,Inf)  147       90     21.8    213.33    220.88
## 
##  Chisq= 236  on 3 degrees of freedom, p= <2e-16
telco_callscat_survdiff %>% glance()
## # A tibble: 1 x 3
##   statistic    df p.value
##       <dbl> <dbl>   <dbl>
## 1      236.     3       0

4 The Cox Proportional Hazards Model

Up to this point we have focused on empirical and non-parametric approaches to estimating survival and hazard functions.

While useful, these approaches do not work when we want to consider multiple variables simultaneously. We could try to construct different KM estimates according to the splits, but splitting along multiple directions rapidly diminishes the amount of data in each split.

Instead, we start to build regression models to estimate the hazard and survival functions.

4.1 Proportional Hazards

The core assumption for the proportional hazards model is that we model the hazard function and construct the survival curve from that.

We assume we have a reference hazard function of time, \(h_0(t)\), known as the baseline hazard. All hazard functions are proportional to this baseline:

\[ h(t) = \psi \, h_0(t) \]

Implicit in the above statement is that the time-effects of the hazard are built into the hazard function \(h_0(t)\) - there is no time-varying effect in the coefficients in the Cox PH model.

The hazards can be greater or less than this baseline - \(\psi > 1\) means the hazard is greater than the baseline and the survival curve will be below the baseline curve; \(\psi < 1\) means it is less.

We model this constant factor \(\psi\) as a linear model - we have covariates \(x_i\) as inputs to the model, and each covariate has a corresponding coefficient \(\beta_i\).

\[ \psi = \exp(\beta \, x) = \exp(\beta_1 x_1 + ... + \beta_n x_x) \]

Thus, stated fully, our Cox PH model can be expressed as follows:

\[ h(t | X) = h_0(t) \exp(\beta \, X) \]

As our data is censored we need to modify our standard maximum likelihood approaches to estimating the parameters of a model - instead we need to use ‘partial likelihoods’ that takes into account the censoring.

We will not go into the details, but this method reduces to optimising a function over the parameter space. This has the side benefit that once we have these parameter values, we can use this to calculate the baseline hazard also.

With this in mind, we now turn our attention to actually fitting the models.

4.2 Our First Cox-PH Model

For our first Cox-PH model, we build a model using the voicemail parameter.

telco_model01_coxph <- coxph(Surv(accountdur, churned) ~ vmailplan
                            ,data = telco_churn_tbl
                             )

telco_model01_coxph %>% summary()
## Call:
## coxph(formula = Surv(accountdur, churned) ~ vmailplan, data = telco_churn_tbl)
## 
##   n= 5000, number of events= 707 
## 
##                coef exp(coef) se(coef)     z Pr(>|z|)    
## vmailplanyes -0.741     0.477    0.107 -6.92  4.7e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## vmailplanyes     0.477        2.1     0.387     0.588
## 
## Concordance= 0.565  (se = 0.008 )
## Rsquare= 0.011   (max possible= 0.88 )
## Likelihood ratio test= 56.6  on 1 df,   p=5e-14
## Wald test            = 47.8  on 1 df,   p=5e-12
## Score (logrank) test = 50  on 1 df,   p=2e-12

Now that we have built our first survival regression model, how do we assess how good the model is?

4.3 Model Assessment

Model assessment in survival analysis is less straightforward as the models produce a distribution rather than a number. We can discuss more robust assessments later, but for now it is worth introducing a number of quantitative measures of model quality.

4.3.1 R-Squared

The standard measure of model fit is \(R^2\), the proportion of variance explained by the model. This does not quite work for survival models, but we can use a modified \(R^2\) statistic, defined for Cox-PH models as follows:

\[ R^2 = 1 - \left( \frac{l(0)}{l(\hat{\beta})} \right)^{\frac{2}{n}} \]

where \(l(\beta)\) is the log-likelihood function.

4.3.2 Concordance Index

The concordance index, or \(c\)-index, or \(c\)-statistic is the proportion of pairwise comparisons correctly identified by the model. To ensure a fair comparison at least one of the rows must not be censored.

We compare comparison times and check that our model correctly predicted the order of times between the compared datapoints - a correct value adds 1 to the total, an incorrect one adds 0. We then divide the total by the number of comparisons made to get our statistic.

Thus, a concordance index of 0.5 is equivalent to a random guess and so is our baseline value. A good survival model will have a \(c\)-index between 0.6 and 0.7 - anything higher is unlikely due to the censored nature of our data.

4.3.3 Assessing Our First Model

So how good is our model?

telco_model01_coxph %>% summary()
## Call:
## coxph(formula = Surv(accountdur, churned) ~ vmailplan, data = telco_churn_tbl)
## 
##   n= 5000, number of events= 707 
## 
##                coef exp(coef) se(coef)     z Pr(>|z|)    
## vmailplanyes -0.741     0.477    0.107 -6.92  4.7e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## vmailplanyes     0.477        2.1     0.387     0.588
## 
## Concordance= 0.565  (se = 0.008 )
## Rsquare= 0.011   (max possible= 0.88 )
## Likelihood ratio test= 56.6  on 1 df,   p=5e-14
## Wald test            = 47.8  on 1 df,   p=5e-12
## Score (logrank) test = 50  on 1 df,   p=2e-12
telco_model01_coxph %>%
    glance() %>%
    select(n, nevent, r.squared, concordance, std.error.concordance)
## # A tibble: 1 x 5
##       n nevent r.squared concordance std.error.concordance
##   <int>  <dbl>     <dbl>       <dbl>                 <dbl>
## 1  5000    707    0.0113       0.565               0.00757
telco_model01_coxph %>% tidy()
## # A tibble: 1 x 7
##   term         estimate std.error statistic  p.value conf.low conf.high
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 vmailplanyes   -0.741     0.107     -6.92 4.67e-12   -0.951    -0.531

This simple model with one binary variable gives us a \(c\)-index of 0.565. The \(R^2\) is essentially zero (0.011), and we will check how both values correlate as we iterate upon the model-building.

4.4 Adding International Plans

For our second model we add international calls, adding it as a single variable before we see what adding both does..

telco_model02_coxph <- coxph(Surv(accountdur, churned) ~ intlplan
                            ,data = telco_churn_tbl
                             )

telco_model02_coxph %>% summary()
## Call:
## coxph(formula = Surv(accountdur, churned) ~ intlplan, data = telco_churn_tbl)
## 
##   n= 5000, number of events= 707 
## 
##               coef exp(coef) se(coef)    z Pr(>|z|)    
## intlplanyes 1.3023    3.6778   0.0838 15.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## intlplanyes      3.68      0.272      3.12      4.33
## 
## Concordance= 0.591  (se = 0.009 )
## Rsquare= 0.038   (max possible= 0.88 )
## Likelihood ratio test= 194  on 1 df,   p=<2e-16
## Wald test            = 242  on 1 df,   p=<2e-16
## Score (logrank) test = 278  on 1 df,   p=<2e-16

Model02 has a \(c\)-index of 0.591, so the international plan has more predictive power. Note that \(R^2\) is increasing with \(c\)-index, so this relationship is worth investigating later.

telco_model02_coxph %>%
    glance() %>%
    select(n, nevent, r.squared, concordance, std.error.concordance)
## # A tibble: 1 x 5
##       n nevent r.squared concordance std.error.concordance
##   <int>  <dbl>     <dbl>       <dbl>                 <dbl>
## 1  5000    707    0.0380       0.591               0.00933
telco_model02_coxph %>% tidy()
## # A tibble: 1 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 intlplanyes     1.30    0.0838      15.5 1.62e-54     1.14      1.47

Having checked the variables separately, we now combine them into a single model containing both variables. We expect combining both variables to improve our model, but if both are highly correlated this may not be the case.

telco_model03_coxph <- coxph(Surv(accountdur, churned) ~ vmailplan + intlplan
                            ,data = telco_churn_tbl
                             )

telco_model03_coxph %>% summary()
## Call:
## coxph(formula = Surv(accountdur, churned) ~ vmailplan + intlplan, 
##     data = telco_churn_tbl)
## 
##   n= 5000, number of events= 707 
## 
##                 coef exp(coef) se(coef)     z Pr(>|z|)    
## vmailplanyes -0.7676    0.4641   0.1071 -7.16  7.8e-13 ***
## intlplanyes   1.3199    3.7429   0.0838 15.75  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## vmailplanyes     0.464      2.155     0.376     0.573
## intlplanyes      3.743      0.267     3.176     4.411
## 
## Concordance= 0.643  (se = 0.01 )
## Rsquare= 0.05   (max possible= 0.88 )
## Likelihood ratio test= 255  on 2 df,   p=<2e-16
## Wald test            = 294  on 2 df,   p=<2e-16
## Score (logrank) test = 332  on 2 df,   p=<2e-16

The combined model, Model03, has a \(c\)-index of 0.643 and an \(R^2\) of 0.05.

telco_model03_coxph %>%
    glance() %>%
    select(n, nevent, r.squared, concordance, std.error.concordance)
## # A tibble: 1 x 5
##       n nevent r.squared concordance std.error.concordance
##   <int>  <dbl>     <dbl>       <dbl>                 <dbl>
## 1  5000    707    0.0497       0.643               0.00969
telco_model03_coxph %>% tidy()
## # A tibble: 2 x 7
##   term         estimate std.error statistic  p.value conf.low conf.high
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 vmailplanyes   -0.768    0.107      -7.16 7.80e-13   -0.978    -0.558
## 2 intlplanyes     1.32     0.0838     15.8  6.82e-56    1.16      1.48

This model appears to be better, and there are many more variables we can try to improve the model.

Before we spend time and effort iterating further on this model, we now spend some time checking the proportional hazards assumption. As we might expect, we have tools to help with this.

4.5 Validating the Proportional Hazards Model

We have skipped much of the statistical theory up to this point, but we now need to look at using some of it.

A common statistical approach for assessing regression models is through analysis of the residuals, typically defined as some form of difference between the predicted and observed values.

For survival analysis, our approach is more involved in terms of how we define the residuals of a model, but the basic concept is the same.

4.5.1 Fitting the Null Model

We first work from a null CoxPH model that we fit - the ‘null’ model:

telco_model00_coxph <- coxph(Surv(accountdur, churned) ~ 1
                            ,data = telco_churn_tbl
                             )

telco_model00_coxph %>% summary()
## Call:  coxph(formula = Surv(accountdur, churned) ~ 1, data = telco_churn_tbl)
## 
## Null model
##   log likelihood= -5293.32 
##   n= 5000

4.5.2 Martingale Residuals

We define our basic residual \(m_i\) as follows:

\[ m_i = \delta_i - \hat{H}_0(t_i) \exp(x_i \hat{\beta}) \]

The residual is the difference between the observed value of the whether the event occurred and its expected value under the Cox model.

telco_model00_resid_m <- telco_model00_coxph %>% residuals(type = 'martingale')

telco_model00_resid_m %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2069 -0.1805 -0.0704  0.0000 -0.0166  0.9998

We have zero-mean, but there is skew in the data as the values are bounded above at 1, but unbounded below. The easiest way to check the shape is look at a histogram of the values.

ggplot() +
    geom_histogram(aes(x = telco_model00_resid_m), bins = 50) +
    xlab('Residual Value') +
    ylab('Count') +
    ggtitle('Histogram of Null-Model Martingale Residuals')

ggplot() +
    geom_histogram(aes(x = telco_model00_resid_m^2), bins = 50) +
    xlab('Squared Residual Value') +
    ylab('Count') +
    ggtitle('Histogram of Null-Model Squared Martingale Residuals')

4.5.3 Deviance Residuals

Martingale residuals cannot be used in sums of squares methods to assess goodness-of-fit - instead we define another quantity when the model is correct is symmetrically distributed with an expectation of zero. These ‘deviance’ residuals, \(d_i\) are defined as follows:

\[ d_i = \text{sign}(m_i) {-2 [m_i + \delta_i \log (\delta_i - m_i)]}^{1/2} \]

telco_model00_resid_d <- telco_model00_coxph %>% residuals(type = 'deviance')

telco_model00_resid_d %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.554  -0.601  -0.375  -0.159  -0.182   3.877

The mean is small, but non-zero. We inspect the distribution of values to see what it looks like.

ggplot() +
    geom_histogram(aes(x = telco_model00_resid_d), bins = 50) +
    xlab('Residual Value') +
    ylab('Count') +
    ggtitle('Histogram of Null-Model Deviance Residuals')

ggplot() +
    geom_histogram(aes(x = telco_model00_resid_d^2), bins = 50) +
    xlab('Squared Residual Value') +
    ylab('Count') +
    ggtitle('Histogram of Null-Model Squared Deviance Residuals')

Neither look distributed as we would like, suggesting this model is not good, but we will return to this topic later.

4.5.4 Residual Plots

Much like for linear models, these residuals are useful to help decide which variables may have predictive power in our survival model. We do this by plotting the values of the residual against the variable values.

To do this, we first add the residuals to the original dataset so that we can construct some of the residual plots.

telco_null_assess_tbl <- telco_churn_tbl %>%
    mutate(resid_m = telco_model00_resid_m
          ,resid_d = telco_model00_resid_d
           )

Now that we have put the table together we can construct some two-way plots of the variable against the martingale residual:

ggplot(telco_null_assess_tbl) +
    geom_boxplot(aes(x = vmailplan, y = resid_m)) +
    xlab('vmailplan') +
    ylab('Martingale Residual') +
    ggtitle('Residual Plot of Martingal Residual against vmailplan')

ggplot(telco_null_assess_tbl) +
    geom_boxplot(aes(x = intlplan, y = resid_m)) +
    xlab('intlplan') +
    ylab('Martingale Residual') +
    ggtitle('Residual Plot of Martingal Residual against intlplan')

ggplot(telco_null_assess_tbl) +
    geom_boxplot(aes(x = areacode, y = resid_m)) +
    xlab('areacode') +
    ylab('Martingale Residual') +
    ggtitle('Residual Plot of Martingal Residual against areacode')

ggplot(telco_null_assess_tbl) +
    geom_boxplot(aes(x = custservicecalls, y = resid_m, group = custservicecalls)) +
    xlab('custservicecalls') +
    ylab('Martingale Residual') +
    ggtitle('Residual Plot of Martingal Residual against custservicecalls')

custservicecalls is a numeric variable, but it does not take many values so we can create boxplots rather than a more typical scatterplot.

For continuous variables, we look at those scatterplots:

ggplot(telco_null_assess_tbl) +
    geom_point(aes(x = vmailmessage, y = resid_m), size = 1, alpha = 0.1) +
    xlab('vmailmessage') +
    ylab('Martingale Residual') +
    ggtitle('Residual Plot of Martingal Residual against vmailmessage')

ggplot(telco_null_assess_tbl) +
    geom_point(aes(x = daymins, y = resid_m), size = 1, alpha = 0.1) +
    xlab('daymins') +
    ylab('Martingale Residual') +
    ggtitle('Residual Plot of Martingal Residual against daymins')

ggplot(telco_null_assess_tbl) +
    geom_point(aes(x = nightmins, y = resid_m), size = 1, alpha = 0.1) +
    xlab('nightmins') +
    ylab('Martingale Residual') +
    ggtitle('Residual Plot of Martingal Residual against nightmins')

ggplot(telco_null_assess_tbl) +
    geom_point(aes(x = intlmins, y = resid_m), size = 1, alpha = 0.1) +
    xlab('intlmins') +
    ylab('Martingale Residual') +
    ggtitle('Residual Plot of Martingal Residual against intlmins')

No strong patterns stand out here, but that does not necessarily rule them out of candidate models. We will not know the predictive power till we try them in a model.

4.6 Assessing the Proportional Hazards Assumption

The proportional hazards assumption is key in the construction of the partial likelihood, as it allows the cancellation of the baseline hazard in the calculations.

This assumption of proportional hazards is likely to be violated in practice, so the question is how big the effect those violations have on our model. Formal statistical tests are thus of limited value, but can be a useful indicator of issues with any models built.

4.6.1 Residual Plots

Tests for the proportional hazards assumption reduce to checking if the parameters from the model are constant over time. This task is made easier by the construction of ‘Schoenfeld’ residuals, which we will not discuss here, but are easily calculated in R by using the cox.zph() function.

telco_model01_coxzph <- telco_model01_coxph %>% cox.zph(transform = 'km')

telco_model01_coxzph %>% print()
##                 rho chisq     p
## vmailplanyes 0.0597  2.52 0.112
telco_model01_coxzph %>% ggcoxzph()

The plot and \(p\)-value suggest the assumptions are not horrible here.

We now move on to the second model we built.

telco_model02_coxzph <- telco_model02_coxph %>% cox.zph(transform = 'km')

telco_model02_coxzph %>% print()
##                rho chisq     p
## intlplanyes 0.0269 0.512 0.474
telco_model02_coxzph %>% ggcoxzph()

The assumption also seems sound here.

We now try the combined model, model03:

telco_model03_coxzph <- telco_model03_coxph %>% cox.zph(transform = 'km')

telco_model03_coxzph %>% print()
##                 rho chisq     p
## vmailplanyes 0.0547 2.100 0.147
## intlplanyes  0.0298 0.626 0.429
## GLOBAL           NA 2.780 0.249
telco_model03_coxzph %>% ggcoxzph()

5 R Environment

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  os       Ubuntu 18.10                
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_IE.UTF-8                 
##  ctype    en_IE.UTF-8                 
##  tz       Europe/Dublin               
##  date     2018-11-25                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  asaur       * 0.50    2016-04-12 [1] CRAN (R 3.5.1)
##  assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.1)
##  backports     1.1.2   2017-12-13 [1] CRAN (R 3.5.1)
##  base64enc     0.1-3   2015-07-28 [1] CRAN (R 3.5.1)
##  bindr         0.1.1   2018-03-13 [1] CRAN (R 3.5.1)
##  bindrcpp    * 0.2.2   2018-03-29 [1] CRAN (R 3.5.1)
##  broom       * 0.5.0   2018-07-17 [1] CRAN (R 3.5.1)
##  callr         3.0.0   2018-08-24 [1] CRAN (R 3.5.1)
##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 3.5.1)
##  cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.1)
##  cmprsk        2.2-7   2014-06-17 [1] CRAN (R 3.5.1)
##  colorspace    1.3-2   2016-12-14 [1] CRAN (R 3.5.1)
##  cowplot     * 0.9.3   2018-07-15 [1] CRAN (R 3.5.1)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)
##  data.table    1.11.8  2018-09-30 [1] CRAN (R 3.5.1)
##  debugme       1.1.0   2017-10-22 [1] CRAN (R 3.5.1)
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)
##  devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.1)
##  digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.1)
##  dplyr       * 0.7.8   2018-11-10 [1] CRAN (R 3.5.1)
##  evaluate      0.12    2018-10-09 [1] CRAN (R 3.5.1)
##  fansi         0.4.0   2018-10-05 [1] CRAN (R 3.5.1)
##  forcats     * 0.3.0   2018-02-19 [1] CRAN (R 3.5.1)
##  fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.1)
##  ggplot2     * 3.1.0   2018-10-25 [1] CRAN (R 3.5.1)
##  ggpubr      * 0.2     2018-11-15 [1] CRAN (R 3.5.1)
##  glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.1)
##  gridExtra     2.3     2017-09-09 [1] CRAN (R 3.5.1)
##  gtable        0.2.0   2016-02-26 [1] CRAN (R 3.5.1)
##  haven         2.0.0   2018-11-22 [1] CRAN (R 3.5.1)
##  hms           0.4.2   2018-03-10 [1] CRAN (R 3.5.1)
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)
##  httr          1.3.1   2017-08-20 [1] CRAN (R 3.5.1)
##  jsonlite      1.5     2017-06-01 [1] CRAN (R 3.5.1)
##  km.ci         0.5-2   2009-08-30 [1] CRAN (R 3.5.1)
##  KMsurv        0.1-5   2012-12-03 [1] CRAN (R 3.5.1)
##  knitr         1.20    2018-02-20 [1] CRAN (R 3.5.1)
##  labeling      0.3     2014-08-23 [1] CRAN (R 3.5.1)
##  lattice       0.20-38 2018-11-04 [3] CRAN (R 3.5.1)
##  lazyeval      0.2.1   2017-10-29 [1] CRAN (R 3.5.1)
##  lubridate     1.7.4   2018-04-11 [1] CRAN (R 3.5.1)
##  magrittr    * 1.5     2014-11-22 [1] CRAN (R 3.5.1)
##  Matrix        1.2-15  2018-11-01 [1] CRAN (R 3.5.1)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)
##  modelr        0.1.2   2018-05-11 [1] CRAN (R 3.5.1)
##  muhaz       * 1.2.6   2014-08-09 [1] CRAN (R 3.5.1)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.5.1)
##  nlme          3.1-137 2018-04-07 [3] CRAN (R 3.5.0)
##  pillar        1.3.0   2018-07-14 [1] CRAN (R 3.5.1)
##  pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.1)
##  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.1)
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)
##  plyr          1.8.4   2016-06-08 [1] CRAN (R 3.5.1)
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)
##  processx      3.2.0   2018-08-16 [1] CRAN (R 3.5.1)
##  ps            1.2.1   2018-11-06 [1] CRAN (R 3.5.1)
##  purrr       * 0.2.5   2018-05-29 [1] CRAN (R 3.5.1)
##  R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.1)
##  Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.1)
##  readr       * 1.2.1   2018-11-22 [1] CRAN (R 3.5.1)
##  readxl        1.1.0   2018-04-20 [1] CRAN (R 3.5.1)
##  remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.1)
##  rlang         0.3.0.1 2018-10-25 [1] CRAN (R 3.5.1)
##  rmarkdown     1.10    2018-06-11 [1] CRAN (R 3.5.1)
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)
##  rstudioapi    0.8     2018-10-02 [1] CRAN (R 3.5.1)
##  rvest         0.3.2   2016-06-17 [1] CRAN (R 3.5.1)
##  scales      * 1.0.0   2018-08-09 [1] CRAN (R 3.5.1)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)
##  stringi       1.2.4   2018-07-20 [1] CRAN (R 3.5.1)
##  stringr     * 1.3.1   2018-05-10 [1] CRAN (R 3.5.1)
##  survival    * 2.43-1  2018-10-29 [3] CRAN (R 3.5.1)
##  survminer   * 0.4.3   2018-08-04 [1] CRAN (R 3.5.1)
##  survMisc      0.5.5   2018-07-05 [1] CRAN (R 3.5.1)
##  testthat      2.0.1   2018-10-13 [1] CRAN (R 3.5.1)
##  tibble      * 1.4.2   2018-01-22 [1] CRAN (R 3.5.1)
##  tidyr       * 0.8.2   2018-10-28 [1] CRAN (R 3.5.1)
##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.1)
##  tidyverse   * 1.2.1   2017-11-14 [1] CRAN (R 3.5.1)
##  usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.1)
##  utf8          1.1.4   2018-05-24 [1] CRAN (R 3.5.1)
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)
##  xml2          1.2.0   2018-01-24 [1] CRAN (R 3.5.1)
##  xtable        1.8-3   2018-08-29 [1] CRAN (R 3.5.1)
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)
##  zoo           1.8-4   2018-09-19 [1] CRAN (R 3.5.1)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/lib/R/site-library
## [3] /usr/lib/R/library